Cross Validation





Kerry Back

  • Cross-validation (CV) is a way to choose optimal hyperparameters using the training data
  • Split the training data into subsets, e.g., A, B, C, D, E
  • Define a finite set of hyperparemeter combinations (a grid) to choose from
    • Example: {“max_depth”: [3, 4], “learning_rate”: [0.05, 0.1]}
    • Example: {“hidden_layer_sizes:[[4, 2], [8, 4, 2], [16, 8, 4]]}

  • Use one of the subsets (e.g., A) as the validation set
  • Train with each of the hyperparameter combinations on the union of the remaining subsets (e.g., B \(\cup\) C \(\cup\) D \(\cup\) E)
  • Compute the trained model scores on A
  • Repeat with B as the validation set, etc.
  • For each hyperparameter combination, end up with as many validation scores as there are subsets

  • Average the validation scores to get a single score for each hyperparameter combination
  • Choose the hyperparameters with the highest average score
  • All of this together is “search over the grid using cross-validation to find the best hyperparameters”
  • It is implemented by scikit-learn’s GridSearchCV function

Example

  • Same data as in 3a-trees
    • agr, bm, idiovol, mom12m, roeq
    • data = 2021-12 (training data)
  • Quantile transform features and ret

Cross validate gradient boosting

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {
  "max_depth": [3, 4], 
  "learning_rate": [0.05, 0.1]
}

cv = GridSearchCV(
  estimator=GradientBoostingRegressor(),
  param_grid=param_grid,
)

_ = cv.fit(Xtrain, ytrain)
pd.DataFrame(cv.cv_results_).iloc[:, 4:]

param_learning_rate param_max_depth params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.05 3 {'learning_rate': 0.05, 'max_depth': 3} 0.217789 0.201953 0.126714 0.050924 0.173691 0.154214 0.060208 1
1 0.05 4 {'learning_rate': 0.05, 'max_depth': 4} 0.192301 0.203017 0.113426 0.021100 0.196805 0.145330 0.070192 2
2 0.1 3 {'learning_rate': 0.1, 'max_depth': 3} 0.176347 0.185516 0.119217 0.034083 0.152840 0.133601 0.054778 3
3 0.1 4 {'learning_rate': 0.1, 'max_depth': 4} 0.169261 0.181719 0.084938 -0.009710 0.163443 0.117930 0.072327 4
<<<<<<< HEAD =======

Cross validate AdaBoost

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(
    base_estimator=DecisionTreeRegressor()
)

param_grid = {
  "base_estimator__random_state": [0],
  "base_estimator__max_depth": [3, 4],
  "learning_rate": [0.1, 0.2]
}

cv = GridSearchCV(
  model,
  param_grid=param_grid
)

cv.fit(Xtrain, ytrain)
pd.DataFrame(cv.cv_results_).iloc[:, 4:]
param_base_estimator__max_depth param_base_estimator__random_state param_learning_rate params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 3 0 0.1 {'base_estimator__max_depth': 3, 'base_estimat... 0.165694 0.172555 0.119074 0.066373 0.145433 0.133826 0.038517 2
1 3 0 0.2 {'base_estimator__max_depth': 3, 'base_estimat... 0.124119 0.141410 0.108078 0.062630 0.079141 0.103076 0.028796 4
2 4 0 0.1 {'base_estimator__max_depth': 4, 'base_estimat... 0.195106 0.191681 0.122983 0.074499 0.132485 0.143351 0.045361 1
3 4 0 0.2 {'base_estimator__max_depth': 4, 'base_estimat... 0.148737 0.165503 0.115922 0.081890 0.079968 0.118404 0.034511 3
>>>>>>> 198cdf0b6b41bcbf61ef15e20943f119ad2ef24e